feat: per-pick confidence scores + abstention (Phase 2.4)#21
Conversation
Extend the selection JSON schema to accept either the legacy
{selected_section_ids: [...]} shape or the new
{picks: [{id, confidence}]} shape with per-pick confidence in
[0.0, 1.0]. ParseSelection returns (ids, confidences, err); legacy
responses surface confidences=nil so callers can distinguish "no
confidence signal" from "all confidences low".
Each strategy plumbs the confidence map through:
- SinglePass fills Result.Confidences from the parsed map, filtered
against the post-FilterKnownIDs survivors.
- ChunkedTree unions per-slice confidence maps (max-wins on duplicate
IDs across overlapping slices) and filters to the merged ID set.
- Agentic accepts both done-shape variants. The new picks shape
surfaces per-pick confidences on the final Result.
Result.SelectedIDs stays []tree.SectionID — the change is purely
additive. Callers that don't care about confidence see no API change.
The strategy never abstains; the API layer's abstention check (next
commit) is the only place "all confidences below threshold" becomes
an abstention response.
Tests cover: new-shape parse, legacy-shape parse, mixed-shape parse
(some picks with confidence, some without), confidence clamping,
duplicate-pick dedup, per-strategy fill, chunked-tree merge, and the
agentic done-with-picks path.
…verrides AbstainBlock carries Enabled + Below (the [0.0, 1.0] confidence threshold below which picks count as "not confident"). When the selection LLM returns explicit per-pick confidence and EVERY pick falls below Below, the API layer surfaces an abstention response instead of pretending the document held an answer. Defaults: Enabled=true (opt-out), Below=0.4. Env overrides: VLE_RETRIEVAL_ABSTAIN_ENABLED (truthy/falsy), VLE_RETRIEVAL_ABSTAIN_BELOW (float in [0,1]). Validation rejects out-of-range Below values; bad env strings preserve the default rather than zeroing the field. Tests cover defaults, env overrides (enable/disable/parse), edge cases (0.0, 1.0 inclusive), bad-input rejection, and validation.
When the selection LLM returns per-pick confidences and every pick
falls strictly below retrieval.abstain.below (default 0.4), the API
layer skips the normal path and returns an abstention response:
/v1/query → sections: [], abstained: true,
abstention_reason, min_confidence_threshold,
candidate_confidences
/v1/answer → answer: "I cannot answer this question from the
supplied document.", citations: [],
same abstention fields, synthesis LLM call skipped
entirely (planning + retrieval usage carried through)
The "all picks below" semantics is deliberate: if even one section
scored at-or-above the threshold the engine surfaces it as evidence.
Abstention is reserved for the case where every candidate is weak.
Abstention requires explicit confidence signal — legacy-shape LLM
responses (no confidence map) always fall through to the normal
path. Per-request `enable_abstain` body field overrides the server
config; opt out globally via retrieval.abstain.enabled: false.
Other changes:
- Result.Confidences threads through the Decomposer (multi-hop
plans union confidences max-wins on overlap).
- Successful (non-abstained) responses surface a `confidences` map
on the wire when the model returned them.
- Abstention responses carry no trace_token — there is no retrieval
result to replay.
- cmd/engine wires cfg.Retrieval.Abstain into the Deps.
Tests cover: shouldAbstain predicate (all-below, one-above,
boundary, nil/empty); filterConfidencesToIDs sentinel preservation;
stringKeyedConfidences conversion; abstentionEnabled body-override
precedence; respondAbstained / respondAbstainedAnswer shape;
synthesis tripwire (LLM must not be called on abstention path);
trace_token absence on abstention.
OpenAPI:
- enable_abstain on QueryRequest + AnswerRequest.
- abstained, abstention_reason, min_confidence_threshold,
candidate_confidences, confidences on both response schemas.
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR implements confidence-driven abstention across the retrieval engine. Selection LLMs now return per-section confidence scores alongside selected IDs, which flow through all retrieval strategies and decomposer. The API evaluates whether all confidences fall below a configurable threshold, and if so, returns an abstention response (empty sections/answer) instead of weak grounding, with per-request override support. ChangesConfidence-Driven Abstention
Sequence Diagram(s)sequenceDiagram
participant HTTPRequest
participant handleQuery
participant runSelection
participant shouldAbstain
participant respondAbstained
HTTPRequest->>handleQuery: enable_abstain override + query
handleQuery->>runSelection: retrieve sections + confidences
runSelection-->>handleQuery: (selectedIDs, confidences, usage)
handleQuery->>shouldAbstain: (confidences, threshold, enabled)
shouldAbstain-->>handleQuery: all below threshold?
alt abstain
handleQuery->>respondAbstained: shape abstention response
respondAbstained-->>HTTPRequest: 200 OK, abstained=true, empty sections
else continue
handleQuery->>handleQuery: proceed to re-rank/synthesis
handleQuery-->>HTTPRequest: 200 OK, sections/answer with confidences
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
picks: [{id, confidence}]shape carrying per-pick confidence in[0.0, 1.0]. The legacyselected_section_idsshape still parses so older / weaker models keep working.retrieval.abstain.below(default0.4),/v1/queryreturns an abstention response (sections: [],abstained: true) and/v1/answerskips synthesis entirely and answers with a canonical refusal.confidencesmap keyed bysection_id. Abstention responses additionally carryabstention_reason,min_confidence_threshold, andcandidate_confidences.Design rationale
Result.SelectedIDsstays[]tree.SectionID—Confidencesis a separatemap[SectionID]float64field, omitted from JSON when empty. Callers that don't care about confidence see no API change.Result.Confidencesif the model returned them; the abstention decision lives entirely in the API layer (internal/api/server.go). This keeps the strategies pure ("return what the model picked") and confines policy to one place.nilconfidence map (legacy LLM response, or new shape with noconfidencekeys populated) is the "no signal" sentinel. The abstention check returnsfalsefornil/ empty maps so older models cannot accidentally trip a refusal.trace_tokenand aren't written to the replay store.Opt-out / configuration
enable_abstain: falseon the/v1/queryor/v1/answerrequest bodyretrieval.abstain.enabled: falsein config.yamltrue(opt-out)VLE_RETRIEVAL_ABSTAIN_ENABLED=falseretrieval.abstain.below: 0.5/VLE_RETRIEVAL_ABSTAIN_BELOW=0.50.4Test plan
go build ./...cleango vet ./...cleango test ./...all green (all pre-existing tests pass + new coverage)Confidences, strategies never abstainshouldAbstainpredicate, helper sentinels,respondAbstainedshape (query + answer), synthesis LLM tripwire (must not be called on abstention path),trace_tokenabsent on abstentionBefore / after examples
LLM new-shape response → confidences populated
LLM all-low response → abstained
Mixed-shape response handled
Legacy response → no abstention
Summary by CodeRabbit
Release Notes
New Features
enable_abstainparameter on query and answer endpoints for per-request abstention override control.Configuration
retrieval.abstainconfiguration block withenabledtoggle andbelowconfidence threshold (default: 0.4, range: 0.0–1.0).